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¢ Phishing definition: 


> “Phishing is a crime employing both social engineering and 
technical subterfuge to steal consumers’ personal identity data 
and financial account credentials”! 


¢ Still an active threat 
* > 2020 4st quarter: 225 000+ attacks! 


¢ Resurgence of phishing mails with COVID-19, after a drop in 
20192 


Motivation (2) 





* Hypothesis: Use of a ready-to-deploy phishing kits, 
leading to high similarity between kits 


* Goal: optimal reconstruction of neighborhood links 
that minimize the overall modification effort among 
kits on the data-set with a “lineage” 


* Gain: find propagation paths of kits, identify 
popular kits to rationalize counter-measure efforts 





Dataset 





* 20 871 kits analyzed 


°* ~340K PHP files, ~200k JS files, ~120k HTML files: 
~182M LOC 


° % of kits with: PHP (99.9%), JS (69.0%), HTML 
(92.7%) 

* PHP code can be find in non-PHP files or obfuscated 
strings 
as code can be executed by eval(string) 


¢ Other types: graphics, CSS ... 
eT 


Fragment similarity 





Type 1: identical clones 

Type 2: parametric clones 

Type 3: structurally similar clones! 
Type 4: semantically similar clones 


Clones of type 1, 2, 3 considered 


1: later referred as simply “similar 
clones” 


Kits comparison 
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function factorial($n) { function factorial($n) { 
// Fragment 1 // Fragment 4 
Shes.— 1: Sires — il: 
for ($i = 2: $i <= $n; $i+ for ($i = 2; $i <= $n; $it+ 
+) { +) { 
s ¢res = $res * $i: $res = $res * $i; 
Parametric } } 
return $res; return $res; 
} 
Orig [ nal // Fragment 2 [| Fragment 5 
echo "This is a message"; echo "Another message"; 
} } 
ii Fragment — —— // Fragment o 
l if ($a == $b) if ($a == $b) 
Distance: sum fragment pairs L1 Laem borane 
elseif ($a > $b) else 
d Ista NCEs. echo "Greater"; echo "Different"; 
else } 
echo "Lower"; function sum($a, $b) { 
// Fraament 7 
D = + d(f2,f5) + + a 











d(f7, 0) 
=0+0+11+18 


= 29 tokens eet 


Kit 1 
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single-language analysis (PHP, JS, HTML) 





KITS COMPARISON 
FOR EACH 
LANGUAGE 


PARTITIONS L1 DISTANCES LINEAGE COMPARISONS 
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Figure 2 — Multi-language 
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Partitioning — Results (1) 








Ea with identical clone(s) Ma with identical clone(s) 
with parametric clone(s) with parametric clone(s) 
mae without clone ms without clone 
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Figure 3 — Multi-language kit Figure 4 — PHP kit duplication 





Lineage — Visualization 


(multi-language) 
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Figure 6 — Inferred lineage of 





Lineage — Focused 


Visualization (1) 
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Figure 7 — Focused visualization of the multi-language lineage (1) 


Lineage — Focused 
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Figure 8 — Focused visualization of the multi-language lineage (2) 


Threats to validity 





* Validation of phishing kits 

¢ Selection bias, limited by our approach 
* Obfuscated files 

° No oracle 

¢ Sensibility to parser, distance 

¢ Ignored languages (CSS) 
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* Summary: 
Static similarity analysis of 20 000+ kits source code, totaling L1BOM+ LOC 
Highly similar kits, consistent kit propagation hypothesis 
Help identifying new kits as related to already known ones 
Find nearest-neighbor propagation paths of kits 
¢ Future work: 
Add new features (comments, dates, obfuscation patterns...) 
Include new languages (CSS) 
Incremental approach (forensic analysis) 





Thank you for listening, 
ll be glad to answer 
any question. 











